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An objective prior that unifies objective Bayes and information-based inference. 
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There are three principle paradigms of statistical inference: (i) Bayesian, (ii) information-based and 
(iii) frequentist inference [1, 2]. We describe an objective prior (the weighting or w-prior) which uni¬ 
fies objective Bayes and information-based inference. The w-prior is chosen to make the marginal 
probability an unbiased estimator of the predictive performance of the model. This definition has 
several other natural interpretations. From the perspective of the information content of the prior, the 
u)-prior is both uniformly and maximally uninformative. The w-prior can also be understood to result 
in a uniform density of distinguishable models in parameter space. Finally we demonstrate the the 
w-prior is equivalent to the Akaike Information Criterion (AIC) for regular models in the asymptotic 
limit. The w-prior appears to be generically applicable to statistical inference and is free of ad hoc 
regularization. The mechanism for suppressing complexity is analogous to AIC: model complexity 
reduces model predictivity. We expect this new objective-Bayes approach to inference to be widely- 
applicable to machine-learning problems including singular models. 

FACS numbers: 


Introduction. A long-standing goal of statistical infer¬ 
ence is fhe formulafion of a consisfenf and generally- 
applicable objecfive-Bayes mefhodology [3]. Alfhough 
fhe Bayes law formally depends on knowledge of a 
prior probabilify disfribufion (prior), from ifs infancy, 
Bayesian analysis has been applied in scenarios where 
fhis prior informafion is unknown [4, 5]. A second prin¬ 
ciple paradigm of inference has been developed around 
a frequenfisf formulafion of probabilify, beginning wifh 
fhe work of C. F. Gauss, R. A. Fisher, efc [6]. In fhe 
1970s, a fhird paradigm of informafion-based inference 
was developed based on fhe pioneering work of Akaike 
[7-9]. Alfhough if is offen possible fo arrive af simi¬ 
lar conclusions using fhese disfincf sfafisfical paradigms 
[10], fhis is nof always fhe case [4, 11]. 

Summary. We mofivafe fhe form of fhe tc-prior using 
fhe Principle of Indifference. We propose a precise im- 
plicif definition of fhe w-prior by defining a Bayes predic¬ 
tive multiplicity: fhe number of indisfinguishable models 
in fhe vicinify of parameferizafion 6. The tc-prior has 
unif mulfiplicify for all model parameferizafions. We 
demonsfrafe fhaf fhe resulfing Bayes parfifion funcfion 
is an unbiased estimator for fhe predictive performance 
of fhe learning machine. Nexf we show fhaf fhe mul¬ 
fiplicify can be undersfood as fhe parameter-coding in¬ 
formafion and using fhis formulafion, we demonsfrafe 
fhaf fhe w-prior is bofh uniformly and maximally unin¬ 
formative. 

Having esfablished fhaf fhe w-prior has many of fhe 
desired properties of an objecfive prior, we explore fhe 
connecfion fo fhe informafion-based paradigm of sfafis¬ 
fical inference. We demonsfrate fhaf for regular models 
[41], objecfive-Bayes inference is asymptotically equiv- 
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alenf fo fhe Akaike Informafion Criferion (AIC) [7-9], 
unifying objecfive-Bayes and informafion-based infer¬ 
ence. Finally, we discuss predictivity as fhe unifying prin¬ 
ciple fhaf is common fo all fhree principle paradigms of 
inference. 

Preliminaries. We infroduce fhe following nofafion for 
a sef of N independenf and identically distributed ob¬ 
servations [42]: 

= {X,,X 2 ,...,Xn) where X,^q{-\0o), (1) 


and the distribution function q is parameterized by Oq, 
the true parameterization. We model the probability dis¬ 
tribution with a set of model parameferizafions 6 G &. 
In general fhis sef will include models of differenf di¬ 
mension (i.e. complexity K = dim0). We assume a 
frequenfisf realizafion of Bayesian sfafisfics: There are 
many realizations of fhe system of inferesf and fhe frue 
parameferizafion is a random variable wifh a disfribu¬ 
fion defined by fhe frue prior disfribufion Oq ^ tuo(-)- 
This frue prior may or may nof be known. 

We infroduce a generalizafed marginal probabilify 
called fhe Bayes partition function [12]: 


zix^\p)= [d9pie)q{x^\e), ( 2 ) 

j@ 

where p(0) is a densify of models on fhe manifold 0 fhaf 
need nof be normalized. When p = wq, fhe marginal 
probabilify (partifion funcfion) has fhe meaning proba¬ 
bilify of observing X^ for an unknown frue paramefer¬ 
izafion. 

The generalized posterior probability is defined: 


pie\x^) 


q{x^\e)p{e) 

Z{X^\p) 


( 3 ) 


which is known as fhe Bayes Law if p = wq. Like fhe 
parfifion funcfion, we generalize fhe posferior fo permif 
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updating of any distribution on ©. In this generalized 
case, the posterior is understood to have the meaning 
of a model weighting. Only when the posterior is con¬ 
structed with the true prior is the posterior understood 
as the probability distribution for the unknown true pa¬ 
rameterization, updated by the observations . Note 
that even if the prior is initially improper (urmormal- 
ized), the posterior will be normalized. 

The Bayes predictive distribution is defined: 


q{X\X^,p) 


ZiX,X^\p) 

Z{X^\p) 


(4) 


where X ^ X^ and is understood to be the Bayesian 
predicted probability distribution for a new observation 
X given N observations X^ and a prior p. 

We define the Bayesian free energy [12]: 


GiX^,p) = -\ogZiX^\p) (5) 


q(X\e) q{X^\e) 




No Measure Target Space ui-prior 

Measure 


FIG. 1: Geometry of inference. Panel A: Inference defines a 
natural measure on parameter space ©. To understand the 
origin of this measure, consider a true parameterization Og 
used to generate observations . We then define a cutoff 
divergence Dd to determine a region of indistinguishability 
on © with volume Vd{0). We define the density of models as 

PD = 


assume that the condition for model distinguishability 
depends on the number of observations. 


and the closely related performance estimator: 

= Ex log Z{X^\p). (6) 

9(-|e) 


Estimating the density of models. V. Balasubramanian 
and others [16] have argued that a natural measure for 
testing distinguishability for 9i f 62 exists and is given 
by the KL Divergence [17,18]: 


The significance of this definition and its interpretation 
as an estimator will be discussed shortly. 

Finally we introduce the natural measure of predic¬ 
tive performance: the ability of the trained Bayes model 
to predict new observations. Again it is convenient to 
formulate the performance of the model for N new ob¬ 
servations. We define the predictivity: 

= N Ex \ogqiX\X^,p), (7) 
9(-|e) 

where X f X^. The predictivity is the Bayesian predic¬ 
tive performance for N simultaneous measurements. 

Principle of Indifference. In instances where no prior 
information was known, both P. S. Laplace [5] and 
T. Bayes [13] invoked a principle of indifference which 
assigned mutually exclusive and exhaustive possibilities 
equal prior probability [14,15]. Although this approach 
does lead to consistent results in the context of models 
with discrete parameterization, it has long been unclear 
how to generalize the principle of indifference to a con¬ 
tinuum context where the meaning of both mutually ex¬ 
clusive and exhaustive are uncertain. 

The prior depends on N. In the interest specificity, con¬ 
sider the set of gaussian distributions with mean p € M 
and variance z; = 1. We can write 6 = {p,v). Are two 
models with pi f p 2 mutually exclusive? Intuitively we 
know the answer: If the difference in the means is suf¬ 
ficiently small, we cannot distinguish the models for a 
small number of observations. But, whatever the values 
of the means pi, if pi f p 2 then a sufficient number of 
observations N can always resolve the models, render¬ 
ing them mutually exclusive. Therefore it is natural to 


D{ei\\e2)= fdxq{x\ei) ( 8 ) 

dsc q{x\92) 

which is a measure of distance between any two proba¬ 
bility distributions. For 9i = 92 , the divergence is iden¬ 
tically zero and D > 0 for q{x\9i) f q{x\92) [17,18]. 

For simplicity, assume that we pick some critical value 
of the divergence I?d, below which two distributions are 
considered indistinguishable (^^d) and above which two 
distributions are distinguishable or mutually exclu¬ 

sive). The sum over equally weighted mutually exclu¬ 
sive distributions can be written as an integral [16]: 

fd9pjg{9), (9) 

© 

where pf^ = Vd is the parameter-space volume of pa- 
rameterizations that are equivalent in the vicinity of 9. 
(For the moment, assume all 0 s © are continuous pa¬ 
rameters of dimension K. A schematic drawing of the 
meaning of the parameter-space volume is illustrated in 
Fig. 1.) The divergence for N observations is simply N 
times the divergence for one. For a large number of ob¬ 
servations N and a regular model, we can write the den¬ 
sity of distinguishable model as [16]: 

PD CX Vd^ (10) 

where I is the Fisher Information Matrix [19]. Clearly 
this density is increasing in N since the resolution of 
the learning machine increases with the number obser¬ 
vations. The canonical approach is to drop the N de¬ 
pendence, keeping only the determinant in the Fisher 
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Information Matrix [16]. But for the purpose of count¬ 
ing distinguishable models with distinct dimension K, 
it is essential to retain the N dependence since it cannot 
be factored out of the sum in Eqn. 9. 

We will make a more precise definition of the density 
of models shortly, but the insistence that the prior (the 
density of models) depends on the number of observa¬ 
tions N has profound consequences as we describe be¬ 
low. 

Note that for fixed model dimension K, we will show 
that Eqn. 10 is correct for a specific value of the criti¬ 
cal divergence Ud . One might hope that inference was 
either independent or weakly dependent on D-q but in 
fact its value is critically important. We therefore need 
to propose a more precise definition of the density of 
models. 

The multiplicity and the w-prior. To define the density 
of models precisely, we first introduce the Bayes predic¬ 
tive multiplicity m: 

logm(0o,p) = Sf‘post{So,p) - £f^e(^'o,p)- (11) 

The meaning of the multiplicity is the number of indis¬ 
tinguishable models defined by the prior in the vicinity 
of Oq. (We will return to this discussion shortly.) 

We define the weighting prior (w-prior) as the density 
of indistinguishable models such that the multiplicity is 
unity for all true models in the set 0: 

m{eQ,w) = l V 0oG0- (12) 

Clearly w will in general be an improper prior. The 
name "weighting" is chosen to emphasize the model¬ 
weighting rather than probabilistic interpretation of anal¬ 
ysis as described below [8]. 

An unbiased estimator of performance. Eor an im¬ 
proper prior, the partition funciton can no longer be 
understood as the marginal probability of observations 
. But the definition of the w-prior gives the partition 
function a precise mathematical meaning: Eqn. 11, eval¬ 
uated at the ic-prior (Eqn. 12), implies that the perfor¬ 
mance estimator (Eqn. 6) and by extension the log parti¬ 
tion function (Eqn. 2) are unbiased estimators [8, 20] of the 
predictive performance of the model (Eqn. 7). Therefore 
the partition function is understood as the probability of 
the last N measurements in the subjective-Bayes frame¬ 
work, but as an estimate of the probability of the next N 
in the objective-Bayes framework. 

Connection to the Gibbs Entropy. To understand the 
mathematical meaning of the multiplicity m (Eqn. 11), 
it is helpful to transform the equation to write it in 
terms of the posterior probability distribution. We an¬ 
alytically continue the number of observations to define 
the continuous variable ejfective temperature: T = N~^ 
(e.g. ]12]). We can now identify the definition of the mul¬ 
tiplicity as exactly analogous to the computation of the 
disorder-averaged Gibbs entropy (5) (e.g. ]21, 22]): 

5(r,p,eo) = -5TF = logm + 6(r), (13) 


from the disorder-averaged Helmholtz free energy {F): 

F{T,p,eo) = -T{\ogZ)^, (14) 

where the angle brackets represent the expectation over 
with respect to the true distribution parameterized 
by ©0 and the order T correction is an error due to 
the analytic continuation of the number of observations 
N ]23]. In a physical context, X^ is quenched disor¬ 
der [16]. The w-prior satisfies the condition that the 
disorder-averaged Gibbs entropy is zero to order ^: 

5(i,^z;,0o) = O + ©(i), (15) 

for all 00 € 0- 

The meaning of multiplicity. Re-expressing Eqn. 13 in 
statistical quantities gives the following expression for 
the multiplicity: 

logTO(0o,p) = Ejf.y Ee log p(e\x«) +®(f)- (16) 

q{-\eo) p(.-\Y^) 

The meaning of the multiplicity is understood as fol¬ 
lows: 9 can be understood as the estimator of 0q. p{6) 
is the density of models at 9 and Vd ~ is 

the parameter-space volume of indistinguishable mod¬ 
els. Therefore the multiplicity m is the number of indis¬ 
tinguishable models defined by the prior p{9). See Eig. 1 
for a schematic illustration. 

Although this expression is almost in the form of the 
divergence, the expectation is taken over the posterior 
probability distribution for an independent dataset . 
This cross-validation form is a key distinction and avoids 
the over-fitting phenomena [43]. 

The parameter-coding interpretation. We now wish to 
re-interpret the multiplicity as a parameter-coding in¬ 
formation. We define the parameter-coding information 
content of the observations: 

He{9o,w) = Ex,Y E^ (17) 

given a normalized prior zu. Note that as before we are 
careful to define the parameter-coding information by 
cross-validation as before with posterior probability dis¬ 
tributions generated from independent datasets X^ and 

yN 

We now need to define a normalized w-prior. We inte¬ 
grate overall parameter space (applying a cutoff if nec¬ 
essary) to determine the normalization constant (total 
number of models): 

N@= [ d9w{9). (18) 

J& 

We define the normalized w-prior: Ww = [44]. 

Inserting the normalized w-prior into the parameter¬ 
coding information gives 

H0{9o,Wyj) = log tv® -1 ©(^), 


(19) 
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which is clearly constant with respect to the true model 
0Q up to order 

The rc-prior is uniformly uninformative. We now wish 
to analyze the dependence of the average parameter¬ 
coding information Hg on the true model Oq. Informa¬ 
tive priors have the property that they code informa¬ 
tion about the underlying true model. A prior localized 
around the true value Oq will reduce the information 
content of the observations. Therefore we might intu¬ 
itively expect an uninformative prior to result in a con¬ 
stant information content with respect to the true model 
00- The ic-prior has this property to order No pa¬ 
rameter values are favored or disfavored. We therefore 
say that the ic-prior is uniformly uninformative since the 
information gain is independent of the true model. 

The rc-prior is maximally uninformative. In the def¬ 
inition of the reference prior, J. M. Bernardo argued 
that an objective and uninformative prior should maxi¬ 
mize the information specified by the observations [24]. 
We therefore construct an expression analogous to that 
which he proposed, differing only in the use of the cross- 
validation form of the parameter-code information: 

Hg{w) = ¥.g Hg{9,w), (20) 

•ca 

which can be understood as the expectation of the 
Hg{6,w) over a true prior zu. A prior which is maxi¬ 
mally uninformative, should be stationary with respect 
to variations in the prior, zuuj is stationary and a least 
locally a maximum to order N~^. We therefore con¬ 
clude that the rc-prior also possesses the property that 
it is maximally uninformative. 

The w-prior for a regular model. In general, there is no 
closed-form expression of the ic-prior although it can be 
computed exactly for a number of special cases and for 
sufficiently simple models [23]. 

In the interest of simplicity consider a regular model 
where the number of continuous degrees of freedom in 
the model parameterization 0is K and work in the large 
number of observations N limit. Using the Gibbs en¬ 
tropy equation for the w-prior, we compute the ic-prior 
[23]: 

= J (21) 

where 

J{0) = \/det I, (22) 

is equal to the square root of the determinant of the 
Fisher Information Matrix and is the well known Jef¬ 
freys prior that H. Jeffreys proposed to insure invariance 
of the probability to re-parameterization of 0 [25, 26]. 
It is instructive to immediately compare this results to 
form we estimated based on principle of indifference 
(Eqn. 10). These arguments correctly identified both the 
Jeffreys prior factor J and the scaling with the number 


of observations N. But, a precise formulation was re¬ 
quired to correctly compute the last factor in Eqn. 21, 
the penalty e~^, which plays a critical role in the reg¬ 
ularization of the w-prior when the complexity of the 
model is unknown. 

Clearly Eqn. 21 implies that for any significant num¬ 
ber of observations N, the w-prior appears to increase 
with model complexity K, but all factors except the for 
e~^ cancel during marginalization over 0 in the compu¬ 
tation of the partition function (Eqn. 2). The qualitative 
understanding of the w-prior is therefore a penalization 
(regularization) of model complexity. 

Equivalence to information-based inference. We now 

compute the w-prior for a regular model of unknown 
complexity K. Typically for models of unknown com¬ 
plexity, the w-prior carmot be computed exactly. We 
have therefore developed a recursive technique [23] 
analogous to that proposed by J. M. Bernardo [24]. We 
write the total model parameterization as 0 = (AT, 0^) 
where the 0^ are K continuous parameters. The first- 
order expression for the rc-prior for a regular model 
of unknown complexity is still given by Eqn. 21 with 
0 0^ [23]. The partition function using the first-order 

expression for the ic-prior is: 

OO 

Z = ^ (23) 

Gk = -\ogZK = -\ogq{X^\K,0^) + K, (24) 

where the 0x are the Maximum Likelihood Estimators 
of the parameters 0^, Zk and Gk are the partition func¬ 
tion and free energy at complexity K. The first term in 
the free energy is the minus-log likelihood and the sec¬ 
ond term is interpreted as a penalization for model com¬ 
plexity. To those familiar with the information-based ap¬ 
proach of Akaike, it is clear that the free energy Gk is 
identical to AIC: 

Gk = klCK, (25) 

for a model with K degrees of freedom [7, 8]. There¬ 
fore, in the asymptotic limit for regular models, the w- 
prior will simply recover information-based inference 
[45]. The K penalty is the information-based realization 
of Occam's Razor: parsimony implies predictivity. 

The ic-prior for singular models. Singular models con¬ 
tain parameters for which the Eisher Information is zero 
(or nearly zero for finite N). AIC fails in the context of 
singular models but we have recently proposed a gener¬ 
alization of AIC called the Erequentist Information Cri¬ 
terion (EIC) for application to singular models [27, 28], 
which is equivalent to Neyman-Pearson hypothesis test¬ 
ing [27, 28]. Given the close connection between AIC 
and the w-prior described above, one might hope that 
w-prior for a singular model would be analogous to EIC. 
The equivalence between information-based inference 
and the rc-prior appears to a hold only for regular mod¬ 
els in the asymptotic limit {N oo). It is well known 
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that Bayesian inference, corresponding to the Schwartz 
distribution topology, has a finite generalization error 
in the asymptotic limit, whereas maximum-likelihood- 
based techniques result in divergent generalization er¬ 
rors in the asymptotic limit (e.g. [12]). We provide a de¬ 
tailed description of fhe connecfions befween objecfive 
bayes, informafion-based and frequenfisf inference else¬ 
where [23, 27]. 

The weighting interpretation. The w-prior and poste¬ 
rior probability distribution for fhe model parameferi- 
zafion 6 should be undersfood as a model weighting, nof 
as fhe probabilify densify fhaf 6 is fhe frue parameferiza- 
fion 6q. We have given fhe w-prior fhe name of weighting 
prior in close analogy fo fhe Akaike weighfs (e.g. [8]). 
From a frequenfisf perspecfive, we are forbidden from 
discussing fhe probabilify of a model which we carmof 
compufe since we do nof know fhe prior from which fhe 
frufh was consfrucfed. The weighting inferprefafion is 
nof simply philosophical poinf, buf has imporfanf com- 
pufafional significance. For insfance, as fhe number of 
observafions N increases, fhe resolufion of fhe objecfive- 
Bayes learning machine increases also and fherefore as 
a consequence fhe weighting of more complex models 
in fhe ic-prior increases also. As a resulf, fhe complex¬ 
ify of fhe fif model nafurally increases wifh fhe size of 
fhe dafasef when describing an infinife dimensional frue 
model, as predicfed by fhe informafion-based approach. 
The infinife complexify of fhe frue model is of no conse¬ 
quence fo fhe selection of fhe objecfive w-prior. 

Like subjecfive-Bayes inference, fhe rc-prior generafes 
a weighfed ensemble of models rafher fhan a poinf es- 
fimafe or confidence infervals and fherefore has bofh 
fhe associafed advanfages and shorf-comings of fhe 
Bayesian machinery. A number of aufhors (e.g. [29]) 
have argued fhaf fhe frequenfisf approach is ifself ad hoc 
due fo fhe bewildering proliferafion of fesfs and sfafis- 
fics. These aufhors may view fhe rc-prior, nof as a fool 
for Bayesian sfafisficians, buf rafher as a missing unify¬ 
ing principle for frequenfisf mefhods fo place fhem on 
par wifh existing Bayesian mefhods. 

Cross-validation. Predictivity cross-validation, boot¬ 
strapping and generalization error are all essentially 
mathematically equivalent measures of model perfor¬ 
mance [46]. Therefore, clearly fhe w-prior can be in- 
ferprefed fo be weighfed fo optimize cross-validafion 
or generalizafion-error-based measures of performance. 
In facf, if has recenfly been formally demonsfrafed fhaf 
sfabilify fo cross-validafion is a necessary and sufficienf 
condifion for fhe predicfive performance of a learning 
machine [30]. 

The central role of predictivity. Motivated by the work 
of Akaike, we have repeafedly made use of fhe princi¬ 
ples of predictive performance. For insfance, see Eqns. 11, 
17 and 27. The crifical considerafion in each of fhese 
equafions is always generalization: fhe model perfor¬ 
mance measured againsf dafa nof included in fhe fram¬ 
ing sef. If is fhis formulafion fhaf leads fo model selec¬ 


tion (or model regularization). For insfance, if one were 
fo define fhe mulfiplicify (Eqn. 11) wifh respecf fo fhe 
posfdicfive performance [47] or even fhe performance 
of fhe frue model [48], fhe ic-prior could nof be con- 
sisfenfly defined for an infinifely-nesfable model since 
fhere would be an ulfraviolef divergence [31] for high- 
complexify (large K) models. If is fhe use of fhe predic¬ 
five performance as fhe model weighting fhaf gives rise 
fo regularization which is bofh nafural and sfafisfically- 
principled. If is unnecessary fo augmenf fhe predicfive 
regularization wifh exogenous and ad hoc regulatory de¬ 
vices such as smoofhing [32], hyper-paramefers [33] or 
vague priors [34]. 

The maximum-predictivity interpretation. Einally we 
wish to discuss our results in the context of subjective 
Bayes analysis where fhe frue prior is known. We nofe 
fhe frue prior is optimal in fwo senses, (i) The frue prior 
maximizes fhe expecfafion of fhe performance estimator 
(e.g. [12]): 

£^^ost(^o.^) = Ee (26) 

which has a unique global maximum at w = tug. Buf 
wifh parficular relevance fo our currenf work, fhe frue 
prior is also optimal in a predicfive sense, (ii) The 
frue prior maximizes fhe expecfafion of fhe predicfivify 
(e.g. [12]): 

^^,Jruo,u7) = EeSP^re(0,zu), (27) 

777Q 

which has a unique global maximum af af tz7 = wg. If 
is fempfrng ask whefher fhere is some prior fhaf opfi- 
mizes predicfivify direcfly if fhe frue prior is nof know, 
in analogy fo Eqn. 27, buf no such prior exisfs [12,23]. 

Discussion. We have presented predicfive performance 
as a unified framework for reconsiling fhe fhree princi¬ 
ple paradigms of sfafisfical inference [1,2]. As discussed 
in fhe previous secfion, subjecfive-Bayesian analysis can 
be undersfood fo direcfly maximize fhe predicfive per¬ 
formance of fhe model if (and only if) fhe frue prior is 
used fo generafe inference. 

Buf by far fhe mosf imporfanf pracfical scenario is an 
unknown frue prior. In fhis cases, fhe predicfive perfor¬ 
mance of fhe model carmof be sfricfly maximized since 
fhe frue prior is unknown. In fhis scenario, we pro¬ 
pose fhaf inference be performed by weighting models 
by fheir expecfed predicfive performance using fhe w- 
prior. As we have demonsfrafed, fhe w-prior resulfs in 
a parfifion function which is fhe unbiased estimator of 
model performance. Therefore inference using fhe w- 
prior can also be undersfood as fhe opfimizafion of pre¬ 
dicfive performance, alfhough nof in fhe sfricf sense of 
maximization. 

The w-prior is improper and yef has a rigorous sfafis¬ 
fical meaning. There has been a long hisfory of fhe suc¬ 
cessful application of improper priors in sfafisfical anal- 
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ysis, most famously by H. Jeffreys [25], and despife con¬ 
siderable, ongoing and heafed debafed abouf fhe sfafis- 
fical meaning and rigor of fhese approaches [33, 35-37]. 
We have proposed one possible rigorous definifion for 
an improper prior and have described why such priors 
have good performance from a frequenfisf perspecfive. 

The ui-prior also has a nafural inferprefafion as an 
uninformafive prior, (i) If is a realization of fhe Princi¬ 
ple of Indifference in fhe sense fhaf fhe prior weighfs all 
distinguishable models equally. Since fhis scenario de¬ 
scribes a sfafe of maximum enfropy, fhe w-prior has a 
MaxEnf inferprefafion. (ii) If is uniformly uninformative 
in fhe sense fhe paramefer-informafion confenf of fhe 


observafions is independenf of fhe frue model, (iii) If 
is maximally uninformative in fhe sense fhaf if maxirrrizes 
fhe model-averaged paramefer-informafion confenf of 
observafions and can fherefore be inferprefed as a ref¬ 
erence prior. Finally we demonsfrafed fhaf fhe rc-prior is 
equivalenf fo informafion-based AIC inference for regu¬ 
lar models. The ic-prior has virfually all of fhe desirable 
properties of an objecfive-Bayesian prior wifh one key 
shorf-coming: Bofh fhe prior and fhe posferior have a 
weighting rafher fhan a Bayesian probabilistic inferpre¬ 
fafion. We believe fhaf such an inferprefafion is nof only 
philosophically desirable buf a mafhemafical necessify. 
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